<!DOCTYPE html>
<HTML lang = "en">
<HEAD>
  <meta charset="UTF-8"/>
  <meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes">
  <title>Basic Parameter Estimation, Reverse-Mode AD, and Inverse Problems</title>
  

  <script type="text/x-mathjax-config">
    MathJax.Hub.Config({
      tex2jax: {inlineMath: [['$','$'], ['\\(','\\)']]},
      TeX: { equationNumbers: { autoNumber: "AMS" } }
    });
  </script>

  <script type="text/javascript" async src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML">
  </script>

  
<style>
pre.hljl {
    border: 1px solid #ccc;
    margin: 5px;
    padding: 5px;
    overflow-x: auto;
    color: rgb(68,68,68); background-color: rgb(251,251,251); }
pre.hljl > span.hljl-t { }
pre.hljl > span.hljl-w { }
pre.hljl > span.hljl-e { }
pre.hljl > span.hljl-eB { }
pre.hljl > span.hljl-o { }
pre.hljl > span.hljl-k { color: rgb(148,91,176); font-weight: bold; }
pre.hljl > span.hljl-kc { color: rgb(59,151,46); font-style: italic; }
pre.hljl > span.hljl-kd { color: rgb(214,102,97); font-style: italic; }
pre.hljl > span.hljl-kn { color: rgb(148,91,176); font-weight: bold; }
pre.hljl > span.hljl-kp { color: rgb(148,91,176); font-weight: bold; }
pre.hljl > span.hljl-kr { color: rgb(148,91,176); font-weight: bold; }
pre.hljl > span.hljl-kt { color: rgb(148,91,176); font-weight: bold; }
pre.hljl > span.hljl-n { }
pre.hljl > span.hljl-na { }
pre.hljl > span.hljl-nb { }
pre.hljl > span.hljl-nbp { }
pre.hljl > span.hljl-nc { }
pre.hljl > span.hljl-ncB { }
pre.hljl > span.hljl-nd { color: rgb(214,102,97); }
pre.hljl > span.hljl-ne { }
pre.hljl > span.hljl-neB { }
pre.hljl > span.hljl-nf { color: rgb(66,102,213); }
pre.hljl > span.hljl-nfm { color: rgb(66,102,213); }
pre.hljl > span.hljl-np { }
pre.hljl > span.hljl-nl { }
pre.hljl > span.hljl-nn { }
pre.hljl > span.hljl-no { }
pre.hljl > span.hljl-nt { }
pre.hljl > span.hljl-nv { }
pre.hljl > span.hljl-nvc { }
pre.hljl > span.hljl-nvg { }
pre.hljl > span.hljl-nvi { }
pre.hljl > span.hljl-nvm { }
pre.hljl > span.hljl-l { }
pre.hljl > span.hljl-ld { color: rgb(148,91,176); font-style: italic; }
pre.hljl > span.hljl-s { color: rgb(201,61,57); }
pre.hljl > span.hljl-sa { color: rgb(201,61,57); }
pre.hljl > span.hljl-sb { color: rgb(201,61,57); }
pre.hljl > span.hljl-sc { color: rgb(201,61,57); }
pre.hljl > span.hljl-sd { color: rgb(201,61,57); }
pre.hljl > span.hljl-sdB { color: rgb(201,61,57); }
pre.hljl > span.hljl-sdC { color: rgb(201,61,57); }
pre.hljl > span.hljl-se { color: rgb(59,151,46); }
pre.hljl > span.hljl-sh { color: rgb(201,61,57); }
pre.hljl > span.hljl-si { }
pre.hljl > span.hljl-so { color: rgb(201,61,57); }
pre.hljl > span.hljl-sr { color: rgb(201,61,57); }
pre.hljl > span.hljl-ss { color: rgb(201,61,57); }
pre.hljl > span.hljl-ssB { color: rgb(201,61,57); }
pre.hljl > span.hljl-nB { color: rgb(59,151,46); }
pre.hljl > span.hljl-nbB { color: rgb(59,151,46); }
pre.hljl > span.hljl-nfB { color: rgb(59,151,46); }
pre.hljl > span.hljl-nh { color: rgb(59,151,46); }
pre.hljl > span.hljl-ni { color: rgb(59,151,46); }
pre.hljl > span.hljl-nil { color: rgb(59,151,46); }
pre.hljl > span.hljl-noB { color: rgb(59,151,46); }
pre.hljl > span.hljl-oB { color: rgb(102,102,102); font-weight: bold; }
pre.hljl > span.hljl-ow { color: rgb(102,102,102); font-weight: bold; }
pre.hljl > span.hljl-p { }
pre.hljl > span.hljl-c { color: rgb(153,153,119); font-style: italic; }
pre.hljl > span.hljl-ch { color: rgb(153,153,119); font-style: italic; }
pre.hljl > span.hljl-cm { color: rgb(153,153,119); font-style: italic; }
pre.hljl > span.hljl-cp { color: rgb(153,153,119); font-style: italic; }
pre.hljl > span.hljl-cpB { color: rgb(153,153,119); font-style: italic; }
pre.hljl > span.hljl-cs { color: rgb(153,153,119); font-style: italic; }
pre.hljl > span.hljl-csB { color: rgb(153,153,119); font-style: italic; }
pre.hljl > span.hljl-g { }
pre.hljl > span.hljl-gd { }
pre.hljl > span.hljl-ge { }
pre.hljl > span.hljl-geB { }
pre.hljl > span.hljl-gh { }
pre.hljl > span.hljl-gi { }
pre.hljl > span.hljl-go { }
pre.hljl > span.hljl-gp { }
pre.hljl > span.hljl-gs { }
pre.hljl > span.hljl-gsB { }
pre.hljl > span.hljl-gt { }
</style>



  <style type="text/css">
  @font-face {
  font-style: normal;
  font-weight: 300;
}
@font-face {
  font-style: normal;
  font-weight: 400;
}
@font-face {
  font-style: normal;
  font-weight: 600;
}
html {
  font-family: sans-serif; /* 1 */
  -ms-text-size-adjust: 100%; /* 2 */
  -webkit-text-size-adjust: 100%; /* 2 */
}
body {
  margin: 0;
}
article,
aside,
details,
figcaption,
figure,
footer,
header,
hgroup,
main,
menu,
nav,
section,
summary {
  display: block;
}
audio,
canvas,
progress,
video {
  display: inline-block; /* 1 */
  vertical-align: baseline; /* 2 */
}
audio:not([controls]) {
  display: none;
  height: 0;
}
[hidden],
template {
  display: none;
}
a:active,
a:hover {
  outline: 0;
}
abbr[title] {
  border-bottom: 1px dotted;
}
b,
strong {
  font-weight: bold;
}
dfn {
  font-style: italic;
}
h1 {
  font-size: 2em;
  margin: 0.67em 0;
}
mark {
  background: #ff0;
  color: #000;
}
small {
  font-size: 80%;
}
sub,
sup {
  font-size: 75%;
  line-height: 0;
  position: relative;
  vertical-align: baseline;
}
sup {
  top: -0.5em;
}
sub {
  bottom: -0.25em;
}
img {
  border: 0;
}
svg:not(:root) {
  overflow: hidden;
}
figure {
  margin: 1em 40px;
}
hr {
  -moz-box-sizing: content-box;
  box-sizing: content-box;
  height: 0;
}
pre {
  overflow: auto;
}
code,
kbd,
pre,
samp {
  font-family: monospace, monospace;
  font-size: 1em;
}
button,
input,
optgroup,
select,
textarea {
  color: inherit; /* 1 */
  font: inherit; /* 2 */
  margin: 0; /* 3 */
}
button {
  overflow: visible;
}
button,
select {
  text-transform: none;
}
button,
html input[type="button"], /* 1 */
input[type="reset"],
input[type="submit"] {
  -webkit-appearance: button; /* 2 */
  cursor: pointer; /* 3 */
}
button[disabled],
html input[disabled] {
  cursor: default;
}
button::-moz-focus-inner,
input::-moz-focus-inner {
  border: 0;
  padding: 0;
}
input {
  line-height: normal;
}
input[type="checkbox"],
input[type="radio"] {
  box-sizing: border-box; /* 1 */
  padding: 0; /* 2 */
}
input[type="number"]::-webkit-inner-spin-button,
input[type="number"]::-webkit-outer-spin-button {
  height: auto;
}
input[type="search"] {
  -webkit-appearance: textfield; /* 1 */
  -moz-box-sizing: content-box;
  -webkit-box-sizing: content-box; /* 2 */
  box-sizing: content-box;
}
input[type="search"]::-webkit-search-cancel-button,
input[type="search"]::-webkit-search-decoration {
  -webkit-appearance: none;
}
fieldset {
  border: 1px solid #c0c0c0;
  margin: 0 2px;
  padding: 0.35em 0.625em 0.75em;
}
legend {
  border: 0; /* 1 */
  padding: 0; /* 2 */
}
textarea {
  overflow: auto;
}
optgroup {
  font-weight: bold;
}
table {
  font-family: monospace, monospace;
  font-size : 0.8em;
  border-collapse: collapse;
  border-spacing: 0;
}
td,
th {
  padding: 0;
}
thead th {
    border-bottom: 1px solid black;
    background-color: white;
}
tr:nth-child(odd){
  background-color: rgb(248,248,248);
}


/*
* Skeleton V2.0.4
* Copyright 2014, Dave Gamache
* www.getskeleton.com
* Free to use under the MIT license.
* http://www.opensource.org/licenses/mit-license.php
* 12/29/2014
*/
.container {
  position: relative;
  width: 100%;
  max-width: 960px;
  margin: 0 auto;
  padding: 0 20px;
  box-sizing: border-box; }
.column,
.columns {
  width: 100%;
  float: left;
  box-sizing: border-box; }
@media (min-width: 400px) {
  .container {
    width: 85%;
    padding: 0; }
}
@media (min-width: 550px) {
  .container {
    width: 80%; }
  .column,
  .columns {
    margin-left: 4%; }
  .column:first-child,
  .columns:first-child {
    margin-left: 0; }

  .one.column,
  .one.columns                    { width: 4.66666666667%; }
  .two.columns                    { width: 13.3333333333%; }
  .three.columns                  { width: 22%;            }
  .four.columns                   { width: 30.6666666667%; }
  .five.columns                   { width: 39.3333333333%; }
  .six.columns                    { width: 48%;            }
  .seven.columns                  { width: 56.6666666667%; }
  .eight.columns                  { width: 65.3333333333%; }
  .nine.columns                   { width: 74.0%;          }
  .ten.columns                    { width: 82.6666666667%; }
  .eleven.columns                 { width: 91.3333333333%; }
  .twelve.columns                 { width: 100%; margin-left: 0; }

  .one-third.column               { width: 30.6666666667%; }
  .two-thirds.column              { width: 65.3333333333%; }

  .one-half.column                { width: 48%; }

  /* Offsets */
  .offset-by-one.column,
  .offset-by-one.columns          { margin-left: 8.66666666667%; }
  .offset-by-two.column,
  .offset-by-two.columns          { margin-left: 17.3333333333%; }
  .offset-by-three.column,
  .offset-by-three.columns        { margin-left: 26%;            }
  .offset-by-four.column,
  .offset-by-four.columns         { margin-left: 34.6666666667%; }
  .offset-by-five.column,
  .offset-by-five.columns         { margin-left: 43.3333333333%; }
  .offset-by-six.column,
  .offset-by-six.columns          { margin-left: 52%;            }
  .offset-by-seven.column,
  .offset-by-seven.columns        { margin-left: 60.6666666667%; }
  .offset-by-eight.column,
  .offset-by-eight.columns        { margin-left: 69.3333333333%; }
  .offset-by-nine.column,
  .offset-by-nine.columns         { margin-left: 78.0%;          }
  .offset-by-ten.column,
  .offset-by-ten.columns          { margin-left: 86.6666666667%; }
  .offset-by-eleven.column,
  .offset-by-eleven.columns       { margin-left: 95.3333333333%; }

  .offset-by-one-third.column,
  .offset-by-one-third.columns    { margin-left: 34.6666666667%; }
  .offset-by-two-thirds.column,
  .offset-by-two-thirds.columns   { margin-left: 69.3333333333%; }

  .offset-by-one-half.column,
  .offset-by-one-half.columns     { margin-left: 52%; }

}
html {
  font-size: 62.5%; }
body {
  font-size: 1.5em; /* currently ems cause chrome bug misinterpreting rems on body element */
  line-height: 1.6;
  font-weight: 400;
  font-family: "Raleway", "HelveticaNeue", "Helvetica Neue", Helvetica, Arial, sans-serif;
  color: #222; }
h1, h2, h3, h4, h5, h6 {
  margin-top: 0;
  margin-bottom: 2rem;
  font-weight: 300; }
h1 { font-size: 3.6rem; line-height: 1.2;  letter-spacing: -.1rem;}
h2 { font-size: 3.4rem; line-height: 1.25; letter-spacing: -.1rem; }
h3 { font-size: 3.2rem; line-height: 1.3;  letter-spacing: -.1rem; }
h4 { font-size: 2.8rem; line-height: 1.35; letter-spacing: -.08rem; }
h5 { font-size: 2.4rem; line-height: 1.5;  letter-spacing: -.05rem; }
h6 { font-size: 1.5rem; line-height: 1.6;  letter-spacing: 0; }

p {
  margin-top: 0; }
a {
  color: #1EAEDB; }
a:hover {
  color: #0FA0CE; }
.button,
button,
input[type="submit"],
input[type="reset"],
input[type="button"] {
  display: inline-block;
  height: 38px;
  padding: 0 30px;
  color: #555;
  text-align: center;
  font-size: 11px;
  font-weight: 600;
  line-height: 38px;
  letter-spacing: .1rem;
  text-transform: uppercase;
  text-decoration: none;
  white-space: nowrap;
  background-color: transparent;
  border-radius: 4px;
  border: 1px solid #bbb;
  cursor: pointer;
  box-sizing: border-box; }
.button:hover,
button:hover,
input[type="submit"]:hover,
input[type="reset"]:hover,
input[type="button"]:hover,
.button:focus,
button:focus,
input[type="submit"]:focus,
input[type="reset"]:focus,
input[type="button"]:focus {
  color: #333;
  border-color: #888;
  outline: 0; }
.button.button-primary,
button.button-primary,
input[type="submit"].button-primary,
input[type="reset"].button-primary,
input[type="button"].button-primary {
  color: #FFF;
  background-color: #33C3F0;
  border-color: #33C3F0; }
.button.button-primary:hover,
button.button-primary:hover,
input[type="submit"].button-primary:hover,
input[type="reset"].button-primary:hover,
input[type="button"].button-primary:hover,
.button.button-primary:focus,
button.button-primary:focus,
input[type="submit"].button-primary:focus,
input[type="reset"].button-primary:focus,
input[type="button"].button-primary:focus {
  color: #FFF;
  background-color: #1EAEDB;
  border-color: #1EAEDB; }
input[type="email"],
input[type="number"],
input[type="search"],
input[type="text"],
input[type="tel"],
input[type="url"],
input[type="password"],
textarea,
select {
  height: 38px;
  padding: 6px 10px; /* The 6px vertically centers text on FF, ignored by Webkit */
  background-color: #fff;
  border: 1px solid #D1D1D1;
  border-radius: 4px;
  box-shadow: none;
  box-sizing: border-box; }
/* Removes awkward default styles on some inputs for iOS */
input[type="email"],
input[type="number"],
input[type="search"],
input[type="text"],
input[type="tel"],
input[type="url"],
input[type="password"],
textarea {
  -webkit-appearance: none;
     -moz-appearance: none;
          appearance: none; }
textarea {
  min-height: 65px;
  padding-top: 6px;
  padding-bottom: 6px; }
input[type="email"]:focus,
input[type="number"]:focus,
input[type="search"]:focus,
input[type="text"]:focus,
input[type="tel"]:focus,
input[type="url"]:focus,
input[type="password"]:focus,
textarea:focus,
select:focus {
  border: 1px solid #33C3F0;
  outline: 0; }
label,
legend {
  display: block;
  margin-bottom: .5rem;
  font-weight: 600; }
fieldset {
  padding: 0;
  border-width: 0; }
input[type="checkbox"],
input[type="radio"] {
  display: inline; }
label > .label-body {
  display: inline-block;
  margin-left: .5rem;
  font-weight: normal; }
ul {
  list-style: circle; }
ol {
  list-style: decimal; }
ul ul,
ul ol,
ol ol,
ol ul {
  margin: 1.5rem 0 1.5rem 3rem;
  font-size: 90%; }
li > p {margin : 0;}
th,
td {
  padding: 12px 15px;
  text-align: left;
  border-bottom: 1px solid #E1E1E1; }
th:first-child,
td:first-child {
  padding-left: 0; }
th:last-child,
td:last-child {
  padding-right: 0; }
button,
.button {
  margin-bottom: 1rem; }
input,
textarea,
select,
fieldset {
  margin-bottom: 1.5rem; }
pre,
blockquote,
dl,
figure,
table,
p,
ul,
ol,
form {
  margin-bottom: 1.0rem; }
.u-full-width {
  width: 100%;
  box-sizing: border-box; }
.u-max-full-width {
  max-width: 100%;
  box-sizing: border-box; }
.u-pull-right {
  float: right; }
.u-pull-left {
  float: left; }
hr {
  margin-top: 3rem;
  margin-bottom: 3.5rem;
  border-width: 0;
  border-top: 1px solid #E1E1E1; }
.container:after,
.row:after,
.u-cf {
  content: "";
  display: table;
  clear: both; }

pre {
  display: block;
  padding: 9.5px;
  margin: 0 0 10px;
  font-size: 13px;
  line-height: 1.42857143;
  word-break: break-all;
  word-wrap: break-word;
  border: 1px solid #ccc;
  border-radius: 4px;
}

pre.hljl {
  margin: 0 0 10px;
  display: block;
  background: #f5f5f5;
  border-radius: 4px;
  padding : 5px;
}

pre.output {
  background: #ffffff;
}

pre.code {
  background: #ffffff;
}

pre.julia-error {
  color : red
}

code,
kbd,
pre,
samp {
  font-family: Menlo, Monaco, Consolas, "Courier New", monospace;
  font-size: 0.9em;
}


@media (min-width: 400px) {}
@media (min-width: 550px) {}
@media (min-width: 750px) {}
@media (min-width: 1000px) {}
@media (min-width: 1200px) {}

h1.title {margin-top : 20px}
img {max-width : 100%}
div.title {text-align: center;}

  </style>
</HEAD>

<BODY>
  <div class ="container">
    <div class = "row">
      <div class = "col-md-12 twelve columns">
        <div class="title">
          <h1 class="title">Basic Parameter Estimation, Reverse-Mode AD, and Inverse Problems</h1>
          <h5>Chris Rackauckas</h5>
          <h5>October 22th, 2020</h5>
        </div>

        <p>Have a model. Have data. Fit model to data.</p>
<p>This is a problem that goes under many different names: <em>parameter estimation</em>, <em>inverse problems</em>, <em>training</em>, etc. In this lecture we will go through the methods for how that&#39;s done, starting with the basics and bringing in the recent techniques from machine learning that can be used to improve the basic implementations.</p>
<h2>The Shooting Method for Parameter Fitting</h2>
<p>Assume that we have some model <span class="math">$u = f(p)$</span> where <span class="math">$p$</span> is our parameters, where we put in some parameters and receive our simulated data <span class="math">$u$</span>. How should you choose <span class="math">$p$</span> such that <span class="math">$u$</span> best fits that data? The <em>shooting method</em> directly uses this high level definition of the model by putting a cost function on the output <span class="math">$C(p)$</span>. This cost function is dependent on a user-choice and it&#39;s model-dependent. However, a common one is the L2-loss. If <span class="math">$y$</span> is our expected data, then the L2-loss function against the data is simply:</p>
<p class="math">\[
C(p) = \Vert f(p) - y \Vert
\]</p>
<p>where <span class="math">$C(p): \mathbb{R}^n \rightarrow \mathbb{R}$</span> is a function that returns a scalar. The shooting method then directly optimizes this cost function by having the optimizer generate a data given new choices of <span class="math">$p$</span>.</p>
<h3>Methods for Optimization</h3>
<p>There are many different nonlinear optimization methods which can be used for this purpose, and for a full survey one should look at packages like <a href="https://github.com/JuliaOpt/JuMP.jl">JuMP</a>, <a href="https://github.com/JuliaNLSolvers/Optim.jl">Optim.jl</a>, and <a href="https://github.com/JuliaOpt/NLopt.jl">NLopt.jl</a>.</p>
<p>There are generally two sets of methods: global and local optimization methods. Local optimization methods attempt to find the best nearby extrema by finding a point where the gradient <span class="math">$\frac{dC}{dp} = 0$</span>. Global optimization methods attempt to explore the whole space and find the best of the extrema. Global methods tend to employ a lot more heuristics and are extremely computationally difficult, and thus many studies focus on local optimization. We will focus strictly on local optimization, but one may want to look into global optimization for many applications of parameter estimation.</p>
<p>Most local optimizers make use of derivative information in order to accelerate the solver. The simplest of which is the method of <em>gradient descent</em>. In this method, given a set of parameters <span class="math">$p_i$</span>, the next step of parameters one will try is:</p>
<p class="math">\[
p_{i+1} = p_i - \alpha \frac{dC}{dP}
\]</p>
<p>that is, update <span class="math">$p_i$</span> by walking in the downward direction of the gradient. Instead of using just first order information, one may want to directly solve the rootfinding problem <span class="math">$\frac{dC}{dp} = 0$</span> using Newton&#39;s method. Newton&#39;s method in this case looks like:</p>
<p class="math">\[
p_{i+1} = p_i - (\frac{d}{dp}\frac{dC}{dp})^{-1} \frac{dC}{dp}
\]</p>
<p>But notice that the Jacobian of the gradient is the Hessian, and thus we can rewrite this as:</p>
<p class="math">\[
p_{i+1} = p_i - H(p_i)^{-1} \frac{dC(p_i)}{dp}
\]</p>
<p>where <span class="math">$H(p)$</span> is the Hessian matrix <span class="math">$H_{ij} = \frac{dC}{dx_i dx_j}$</span>. However, solving a system of equations which involves the Hessian can be difficult &#40;just like the Jacobian, but now with another layer of differentiation&#33;&#41;, and thus many optimization techniques attempt to avoid the Hessian. A commonly used technique that is somewhat in the middle is the <em>BFGS</em> technique, which is a gradient-based optimization method that attempts to approximate the Hessian along the way to modify its stepping behavior. It uses the history of previously calculated points in order to build this quick Hessian approximate. If one keeps only a constant length history, say 5 time points, then one arrives at the <em>l-BFGS</em> technique, which is one of the most common large-scale optimization techniques.</p>
<h3>Connection Between Optimization and Differental Equations</h3>
<p>There is actually a strong connection between optimization and differential equations. Let&#39;s say we wanted to follow the gradient of the solution towards a local minimum. That would mean that the flow that we would wish to follow is given by an ODE, specifically the ODE:</p>
<p class="math">\[
p' = -\frac{dC}{dp}
\]</p>
<p>If we apply the Euler method to this ODE, then we recieve</p>
<p class="math">\[
p_{n+1} = p_n - \alpha \frac{dC(p_n)}{dp}
\]</p>
<p>and we thus recover the gradient descent method. Now assume that you want to use implicit Euler. Then we would have the system</p>
<p class="math">\[
p_{n+1} = p_n - \alpha \frac{dC(p_{n+1})}{dp}
\]</p>
<p>which we would then move to one side:</p>
<p class="math">\[
p_{n+1} - p_n + \alpha \frac{dC(p_{n+1})}{dp} = 0
\]</p>
<p>and solve each step via a Newton method. For this Newton method, we need to take the Jacobian of this gradient function, and once again the Hessian arrives as the fundamental quantity.</p>
<h3>Neural Network Training as a Shooting Method for Functions</h3>
<p>A one layer dense neuron is traditionally written as the function:</p>
<p class="math">\[
layer(x) = \sigma.(Wx + b)
\]</p>
<p>where <span class="math">$x \in \mathbb{R}^n$</span>, <span class="math">$W \in \mathbb{R}^{m \times n}$</span>, <span class="math">$b \in \mathbb{R}^{m}$</span> and <span class="math">$\sigma$</span> is some choice of <span class="math">$\mathbb{R}\rightarrow\mathbb{R}$</span> nonlinear function, where the <code>.</code> is the Julia dot to signify element-wise operation.</p>
<p>A traditional <em>neural network</em>, <em>feed-forward network</em>, or <em>multi-layer perceptron</em> is a 3 neuron function, i.e.</p>
<p class="math">\[
NN(x) = W_3 \sigma_2.(W_2\sigma_1.(W_1x + b_1) + b_2) + b_3
\]</p>
<p>where the first layer is called the input layer, the second is called the hidden layer, and the final is called the output layer. This specific function was seen as desirable because of the <em>Universal Approximation Theorem</em>, which is formally stated as follows:</p>
<p>Let <span class="math">$\sigma$</span> be a nonconstant, bounded, and continuous function. Let <span class="math">$I_m = [0,1]^m$</span>. The space of real-valued continuous functions on <span class="math">$I_m$</span> is denoted by <span class="math">$C(I_m)$</span>. For any <span class="math">$\epsilon >0$</span> and any <span class="math">$f\in C(I_m)$</span>, there exists an integer <span class="math">$N$</span>, real constants <span class="math">$W_i$</span> and <span class="math">$b_i$</span> s.t.</p>
<p class="math">\[
\Vert NN(x) - f(x) \Vert < \epsilon
\]</p>
<p>for all <span class="math">$x \in I_m$</span>. Equivalently, <span class="math">$NN$</span> given parameters is dense in <span class="math">$C(I_m)$</span>.</p>
<p>However, it turns out that using only one hidden layer can require exponential growth in the size of said hidden layer, where the size is given by the number of columns in <span class="math">$W_1$</span>. To counteract this, <em>deep neural networks</em> were developed to be in the form of the recurrance relation:</p>
<p class="math">\[
v_{i+1} = \sigma_i.(W_i v_{i} + b_i)
\]</p>
<p class="math">\[
v_1 = x
\]</p>
<p class="math">\[
DNN(x) = v_{n}
\]</p>
<p>for some <span class="math">$n$</span> where <span class="math">$n$</span> is the number of <em>layers</em>. given a sufficient size of the hidden layers, this kind of function is a universal approximator &#40;2017&#41;. Although it&#39;s not quite known yet, some results have shown that this kind of function is able to fit high dimensional functions without the <em>curse of dimensionality</em>, i.e. the number of parameters does not grow exponentially with the input size. More mathematical results in this direction are still being investigated.</p>
<p>However, this theory gives a direct way to transform the fitting of an arbitrary function into a parameter shooting problem. Given an unknown function <span class="math">$f$</span> one wishes to fit, one can place the cost function</p>
<p class="math">\[
C(p) = \Vert DNN(x;p) - f(x) \Vert
\]</p>
<p>where <span class="math">$DNN(x;p)$</span> signifies the deep neural network given by the parameters <span class="math">$p$</span>, where the full set of parameters is the <span class="math">$W_i$</span> and <span class="math">$b_i$</span>. To make the evaluation of that function be practical, we can instead say we wish to evaluate the difference at finitely many points:</p>
<p class="math">\[
C(p) = \sum_k^N \Vert DNN(x_k;p) - f(x_k) \Vert
\]</p>
<p><em>Training</em> a neural network is machine learning speak for finding the <span class="math">$p$</span> which minimizes this cost function. Notice that this is then a shooting method problem, where a cost function is defined by direct evaluations of the model with some choice of parameters.</p>
<h3>Recurrent Neural Networks</h3>
<p>Recurrent neural networks are networks which are given by the recurrance relation:</p>
<p class="math">\[
x_{k+1} = x_k + DNN(x_k,k;p)
\]</p>
<p>Given our machinery, we can see this is equivalent to the Euler discretization with <span class="math">$\Delta t = 1$</span> on the <em>neural ordinary differential equation</em> defined by:</p>
<p class="math">\[
x' = DNN(x,t;p)
\]</p>
<p>Thus a recurrent neural network is a sequence of applications of a neural network &#40;or possibly a neural network indexed by integer time&#41;.</p>
<h2>Computing Gradients</h2>
<p>This shows that many different problems, from training neural networks to fitting differential equations, all have the same underlying mathematical structure which requires the ability to compute the gradient of a cost function given model evaluations. However, this simply reduces to computing the gradient of the model&#39;s output given the parameters. To see this, let&#39;s take for example the L2 loss function, i.e.</p>
<p class="math">\[
C(p) = \sum_i^N \Vert f(x_i;p) - y_i \Vert
\]</p>
<p>for some finite data points <span class="math">$y_i$</span>. In the ODE model, <span class="math">$y_i$</span> are time series points. In the general neural network, <span class="math">$y_i = d(x_i)$</span> for the function we wish to fit <span class="math">$d$</span>. In data science applications of machine learning, <span class="math">$y_i = d_i$</span> the discrete data points we wish to fit. In any of these cases, we see that by the chain rule we have</p>
<p class="math">\[
\frac{dC}{dp} = \sum_i^N 2 \left(f(x_i;p) - y_i \right) \frac{df(x_i)}{dp}
\]</p>
<p>and therefore, knowing how to efficiently compute <span class="math">$\frac{df(x_i)}{dp}$</span> is the essential question for shooting-based parameter fitting.</p>
<h3>Forward-Mode Automatic Differentiation for Gradients</h3>
<p>Let&#39;s recall the forward-mode method for computing gradients. For an arbitrary nonlinear function <span class="math">$f$</span> with scalar output, we can compute derivatives by putting a dual number in. For example, with</p>
<p class="math">\[
d = d_0 + v_1 \epsilon_1 + \ldots + v_m \epsilon_m
\]</p>
<p>we have that</p>
<p class="math">\[
f(d) = f(d_0) + f'(d_0)v_1 \epsilon_1 + \ldots + f'(d_0)v_m \epsilon_m
\]</p>
<p>where <span class="math">$f'(d_0)v_i$</span> is the direction derivative in the direction of <span class="math">$v_i$</span>. To compute the gradient with respond to the input, we thus need to make <span class="math">$v_i = e_i$</span>.</p>
<p>However, in this case we now do not want to compute the derivative with respect to the input&#33; Instead, now we have <span class="math">$f(x;p)$</span> and want to compute the derivatives with respect to <span class="math">$p$</span>. This simply means that we want to take derivatives in the directions of the parameters. To do this, let:</p>
<p class="math">\[
x = x_0 + 0 \epsilon_1 + \ldots + 0 \epsilon_k
\]</p>
<p class="math">\[
P = p + e_1 \epsilon_1 + \ldots + e_k \epsilon_k
\]</p>
<p>where there are <span class="math">$k$</span> parameters. We then have that</p>
<p class="math">\[
f(x;P) = f(x;p) + \frac{df}{dp_1} \epsilon_1 + \ldots + \frac{df}{dp_k} \epsilon_k
\]</p>
<p>as the output, and thus a <span class="math">$k+1$</span>-dimensional number computes the gradient of the function with respect to <span class="math">$k$</span> parameters.</p>
<p>Can we do better?</p>
<h3>The Adjoint Technique and Reverse Accumulation</h3>
<p>The fast method for computing gradients goes under many times. The <em>adjoint technique</em>, <em>backpropogation</em>, and <em>reverse-mode automatic differentiation</em> are in some sense all equivalent phrases given to this method from different disciplines. To understand the adjoint technique, we will look at the multivariate chain rule on a <em>computation graph</em>. Recall that for <span class="math">$f(x(t),y(t))$</span> that we have:</p>
<p class="math">\[
\frac{df}{dt} = \frac{df}{dx}\frac{dx}{dt} + \frac{df}{dy}\frac{dy}{dt}
\]</p>
<p>We can visualize our direct dependences as the computation graph:</p>
<p><img src="https://user-images.githubusercontent.com/1814174/66461367-e3162380-ea46-11e9-8e80-09b32e138269.PNG" alt="" /></p>
<p>i.e. <span class="math">$t$</span> directly determines <span class="math">$x$</span> and <span class="math">$y$</span> which then determines <span class="math">$f$</span>. To calculate Assume you&#39;ve already evaluated <span class="math">$f(t)$</span>. If this has been done, then you&#39;ve already had to calculate <span class="math">$x$</span> and <span class="math">$y$</span>. Thus given the function <span class="math">$f$</span>, we can now calculate <span class="math">$\frac{df}{dx}$</span> and <span class="math">$\frac{df}{dy}$</span>, and then calculate <span class="math">$\frac{dx}{dt}$</span> and <span class="math">$\frac{dy}{dt}$</span>.</p>
<p>Now let&#39;s put another layer in the computation. Let&#39;s make <span class="math">$f(x(v(t),w(t)),y(v(t),w(t))$</span>. We can write out the full expression for the derivative. Notice that even with this additional layer, the statement we wrote above still holds:</p>
<p class="math">\[
\frac{df}{dt} = \frac{df}{dx}\frac{dx}{dt} + \frac{df}{dy}\frac{dy}{dt}
\]</p>
<p>So given an evaluation of <span class="math">$f$</span>, we can &#40;still&#41; directly calculate <span class="math">$\frac{df}{dx}$</span> and <span class="math">$\frac{df}{dy}$</span>. But now, to calculate <span class="math">$\frac{dx}{dt}$</span> and <span class="math">$\frac{dy}{dt}$</span>, we do the next step of the chain rule:</p>
<p class="math">\[
\frac{dx}{dt} = \frac{dx}{dv}\frac{dv}{dt} + \frac{dx}{dw}\frac{dw}{dt}
\]</p>
<p>and similar for <span class="math">$y$</span>. So plug it all in, and you see that our equations will grow wild if we actually try to plug it in&#33; But it&#39;s clear that, to calculate <span class="math">$\frac{df}{dt}$</span>, we can first calculate <span class="math">$\frac{df}{dx}$</span>, and then multiply that to <span class="math">$\frac{dx}{dt}$</span>. If we had more layers, we could calculate the <em>sensitivity</em> &#40;the derivative&#41; of the output to the last layer, then and then the sensitivity to the second layer back is the sensitivity of the last layer multiplied to that, and the third layer back has the sensitivity of the second layer multiplied to it&#33;</p>
<h3>Logistic Regression Example</h3>
<p>To better see this structure, let&#39;s write out a simple example. Let our <em>forward pass</em> through our function be:</p>
<p class="math">\[
\begin{align}
z &= wx + b\\
y &= \sigma(z)\\
\mathcal{L} &= \frac{1}{2}(y-t)^2\\
\mathcal{R} &= \frac{1}{2}w^2\\
\mathcal{L}_{reg} &= \mathcal{L} + \lambda \mathcal{R}\end{align}
\]</p>
<p><img src="https://user-images.githubusercontent.com/1814174/66462825-e2cb5780-ea49-11e9-9804-240037fb6b56.PNG" alt="" /></p>
<p>The formulation of the program here is called a <em>Wengert list, tape, or graph</em>. In this, <span class="math">$x$</span> and <span class="math">$t$</span> are inputs, <span class="math">$b$</span> and <span class="math">$W$</span> are parameters, <span class="math">$z$</span>, <span class="math">$y$</span>, <span class="math">$\mathcal{L}$</span>, and <span class="math">$\mathcal{R}$</span> are intermediates, and <span class="math">$\mathcal{L}_{reg}$</span> is our output.</p>
<p>This is a simple univariate logistic regression model. To do logistic regression, we wish to find the parameters <span class="math">$w$</span> and <span class="math">$b$</span> which minimize the distance of <span class="math">$\mathcal{L}_{reg}$</span> from a desired output, which is done by computing derivatives.</p>
<p>Let&#39;s calculate the derivatives with respect to each quantity in reverse order. If our program is <span class="math">$f(x) = \mathcal{L}_{reg}$</span>, then we have that</p>
<p class="math">\[
\frac{df}{\mathcal{L}_{reg}} = 1
\]</p>
<p>as the derivatives of the last layer. To computerize our notation, let&#39;s write</p>
<p class="math">\[
\overline{\mathcal{L}_{reg}} = \frac{df}{\mathcal{L}_{reg}}
\]</p>
<p>for our computed values. For the derivatives of the second to last layer, we have that:</p>
<p class="math">\[
\begin{align}
  \overline{\mathcal{R}} &= \frac{df}{\mathcal{L}_{reg}} \frac{d\mathcal{L}_{reg}}{\mathcal{R}}\\
                         &= \overline{\mathcal{L}_{reg}} \lambda \end{align}
\]</p>
<p class="math">\[
\begin{align}
 \overline{\mathcal{L}} &= \frac{df}{\mathcal{L}_{reg}} \frac{d\mathcal{L}_{reg}}{\mathcal{L}}\\
                        &= \overline{\mathcal{L}_{reg}} \end{align}
\]</p>
<p>This was our observation from before that the derivative of the second layer is the partial derivative of the current values times the sensitivity of the final layer. And then we keep multiplying, so now for our next layer we have that:</p>
<p class="math">\[
\begin{align}
  \overline{y} &= \overline{\mathcal{L}} \frac{d\mathcal{L}}{dy}\\
               &= \overline{\mathcal{L}} (y-t) \end{align}
\]</p>
<p>And notice that the chain rule holds since <span class="math">$\overline{\mathcal{L}}$</span> implicitly already has the multiplication by <span class="math">$\overline{\mathcal{L}_{reg}}$</span> inside of it. Then the next layer is:</p>
<p class="math">\[
\begin{align}
 \frac{df}{z} &= \overline{y} \frac{dy}{dz}\\
              &= \overline{y} \sigma^\prime(z) \end{align}
\]</p>
<p>Then the next layer. Notice that here, by the chain rule on <span class="math">$w$</span> we have that:</p>
<p class="math">\[
\begin{align}
  \overline{w} &= \overline{z} \frac{\partial z}{\partial w} + \overline{\mathcal{R}} \frac{d \mathcal{R}}{dw}\\
               &= \overline{z} x + \overline{\mathcal{R}} w\end{align}
\]</p>
<p class="math">\[
\begin{align}
 \overline{b} &= \overline{z} \frac{\partial z}{\partial b}\\
              &= \overline{z} \end{align}
\]</p>
<p>This completely calculates all derivatives. In conclusion, the rule is:</p>
<ul>
<li><p>You sum terms from each outward arrow</p>
</li>
<li><p>Each arrow has the derivative term of the end times the partial of the current term.</p>
</li>
<li><p>Recurse backwards to build simple linear combination expressions.</p>
</li>
</ul>
<p>You can thus think of the relations as a message passing relation in reverse to the forward pass:</p>
<p><img src="https://user-images.githubusercontent.com/1814174/66466679-1b226400-ea51-11e9-9e3c-5cc7939c243b.PNG" alt="" /></p>
<p>Note that the reverse-pass has the values of the forward pass, like <span class="math">$x$</span> and <span class="math">$t$</span>, embedded within it.</p>
<h3>Backpropogation of a Neural Network</h3>
<p>Now let&#39;s look at backpropgation of a deep neural network. Before getting to it in the linear algebraic sense, let&#39;s write everything in terms of scalars. This means we can write a simple neural network as:</p>
<p class="math">\[
\begin{align}
  z_i &= \sum_j W_{ij}^1 x_j + b_i^1\\
  h_i &= \sigma(z_i)\\
  y_i &= \sum_j W_{ij}^2 h_j + b_i^2\\
  \mathcal{L} &= \frac{1}{2} \sum_k \left(y_k - t_k \right)^2 \end{align}
\]</p>
<p>where I have chosen the L2 loss function. This is visualized by the computational graph:</p>
<p><img src="https://user-images.githubusercontent.com/1814174/66464817-ad286d80-ea4d-11e9-9a4c-f7bcf1b34475.PNG" alt="" /></p>
<p>Then we can do the same process as before to get:</p>
<p class="math">\[
\begin{align}
  \overline{\mathcal{L}} &= 1\\
  \overline{y_i} &= \overline{\mathcal{L}} (y_i - t_i)\\
  \overline{w_{ij}^2} &= \overline{y_i} h_i\\
  \overline{b_i^2} &= \overline{y_i}\\
  \overline{h_i} &= \sum_k (\overline{y_k}w_{ki}^2)\\
  \overline{z_i} &= \overline{h_i}\sigma^\prime(z_i)\\
  \overline{w_{ij}^1} &= \overline{z_i} x_j\\
  \overline{b_i^1} &= \overline{z_i}\end{align}
\]</p>
<p>just by examining the computation graph. Now let&#39;s write this in linear algebraic form.</p>
<p><img src="https://user-images.githubusercontent.com/1814174/66465741-69366800-ea4f-11e9-9c20-07806214008b.PNG" alt="" /></p>
<p>The forward pass for this simple neural network was:</p>
<p class="math">\[
\begin{align}
  z &= W_1 x + b_1\\
  h &= \sigma(z)\\
  y &= W_2 h + b_2\\
  \mathcal{L} = \frac{1}{2} \Vert y-t \Vert^2 \end{align}
\]</p>
<p>If we carefully decode our scalar expression, we see that we get the following:</p>
<p class="math">\[
\begin{align}
  \overline{\mathcal{L}} &= 1\\
  \overline{y} &= \overline{\mathcal{L}}(y-t)\\
  \overline{W_2} &= \overline{y}h^{T}\\
  \overline{b_2} &= \overline{y}\\
  \overline{h} &= W_2^T \overline{y}\\
  \overline{z} &= \overline{h} .* \sigma^\prime(z)\\
  \overline{W_1} &= \overline{z} x^T\\
  \overline{b_1} &= \overline{z} \end{align}
\]</p>
<p>We can thus decode the rules as:</p>
<ul>
<li><p>Multiplying by the matrix going forwards means multiplying by the transpose going backwards. A term on the left stays on the left, and a term on the right stays on the right.</p>
</li>
<li><p>Element-wise operations give element-wise multiplication</p>
</li>
</ul>
<p>Notice that the summation is then easily encoded into this rule by the transpose operation.</p>
<p>We can write it in the general DNN form of:</p>
<p class="math">\[
r_i = W_i v_{i} + b_i
\]</p>
<p class="math">\[
v_{i+1} = \sigma_i.(r_i)
\]</p>
<p class="math">\[
v_1 = x
\]</p>
<p class="math">\[
\mathcal{L} = \frac{1}{2} \Vert v_{n} - t \Vert
\]</p>
<p class="math">\[
\begin{align}
  \overline{\mathcal{L}} &= 1\\
  \overline{v_n} &= \overline{\mathcal{L}}(y-t)\\
  \overline{r_i} &= \overline{v_i} .* \sigma_i^\prime (r_i)\\
  \overline{W_i} &= \overline{v_i}r_{i-1}^{T}\\
  \overline{b_i} &= \overline{v_i}\\
  \overline{v_{i-1}} &= W_{i}^{T} \overline{v_i} \end{align}
\]</p>
<h3>Reverse-Mode Automatic Differentiation and vjps</h3>
<p>Backpropogation of a neural network is thus a different way of accumulating derivatives. If <span class="math">$f$</span> is a composition of <span class="math">$L$</span> functions:</p>
<p class="math">\[
f = f^L \circ f^{L-1} \circ \ldots \circ f^1
\]</p>
<p>Then the Jacobian matrix satisfies:</p>
<p class="math">\[
J = J_L J_{L-1} \ldots J_1
\]</p>
<p>A program is essentially a nice way of writing a function in composition form. Forward-mode automatic differentiation worked by propogating forward the actions of the Jacobians at every step of the program:</p>
<p class="math">\[
Jv = J_L (J_{L-1} (\ldots (J_1 v) \ldots ))
\]</p>
<p>effectively calculating the Jacobian of the program by multiplying by the Jacobians from left to right at each step of the way. This means doing primitive <span class="math">$Jv$</span> calculations on each underlying problem, and pushing that calculation through.</p>
<p>But what about reverse accumulation? This can be isolated to the simple expression graph:</p>
<p><img src="https://user-images.githubusercontent.com/1814174/66471491-650f4800-ea59-11e9-9b42-4b32d0d0d76f.PNG" alt="" /></p>
<p>In backpropogation, we just showed that when doing reverse accumulation, the rule is that multiplication forwards is multiplication by the transpose backwards. So if the forward way to compute the Jacobian in reverse is to replace the matrix by its transpose:</p>
<p><img src="https://user-images.githubusercontent.com/1814174/66471687-c6cfb200-ea59-11e9-9b80-f9206ffda87f.PNG" alt="" /></p>
<p>We can either look at it as <span class="math">$J^T v$</span>, or by transposing the equation <span class="math">$v^T J$</span>. It&#39;s right there that we have a vector-transpose Jacobian product, or a <em>vjp</em>.</p>
<p>We can thus think of this as a different direction for the Jacobian accumulation. Reverse-mode automatic differentiation moves backwards through our composed Jacobian. For a value <span class="math">$v$</span> at the end, we can push it backwards:</p>
<p class="math">\[
v^T J = (\ldots ((v^T J_L) J_{L-1}) \ldots ) J_1
\]</p>
<p>doing a vjp at every step of the way, which is simply doing reverse-mode AD of that function &#40;and if it&#39;s linear, then simply doing the matrix multiplication&#41;. Thus reverse-mode AD is just a grouping of vjps into a single larger expression, instead of linearizing every single step.</p>
<h3>Primitives of Reverse Mode</h3>
<p>For forward-mode AD, we saw that we could define primitives in order to accelerate the calculation. For example, knowing that</p>
<p class="math">\[
exp(x+\epsilon) = exp(x) + exp(x)\epsilon
\]</p>
<p>allows the program to skip autodifferentiating through the code for <code>exp</code>. This was simple with forward-mode since we could represent the operation on a Dual number. What&#39;s the equivalent for reverse-mode AD? The answer is the <em>pullback</em> function. If <span class="math">$y = [y_1,y_2,\ldots] = f(x_1,x_2, \ldots)$</span>, then <span class="math">$[\overline{x_1},\overline{x_2},\ldots]=\mathcal{B}_f^x(\overline{y})$</span> is the pullback of <span class="math">$f$</span> at the point <span class="math">$x$</span>, defined for a scalar loss function <span class="math">$L(y)$</span> as:</p>
<p class="math">\[
\overline{x_i} = \frac{\partial L}{\partial x} = \sum_i \frac{\partial L}{\partial y_i} \frac{\partial y_i}{\partial x_i}
\]</p>
<p>Using the notation from earlier, <span class="math">$\overline{y} = \frac{\partial L}{\partial y}$</span> is the derivative of the some intermediate w.r.t. the cost function, and thus</p>
<p class="math">\[
\overline{x_i} = \sum_i \overline{y_i} \frac{\partial y_i}{\partial x_i} = \mathcal{B}_f^x(\overline{y})
\]</p>
<p>Note that <span class="math">$\mathcal{B}_f^x(\overline{y})$</span> is a function of <span class="math">$x$</span> because the reverse pass that is use embeds values from the forward pass, and the values from the forward pass to use are those calculated during the evaluation of <span class="math">$f(x)$</span>.</p>
<p>By the chain rule, if we don&#39;t have a primitive defined for <span class="math">$y_i(x)$</span>, we can compute that by <span class="math">$\mathcal{B}_{y_i}(\overline{y})$</span>, and recursively apply this process until we hit rules that we know. The rules to start with are the scalar derivative rules with follow quite simply, and the multivariate rules which we derived above. For example, if <span class="math">$y=f(x)=Ax$</span>, then</p>
<p class="math">\[
\mathcal{B}_{f}^x(\overline{y}) = \overline{y}^T A
\]</p>
<p>which is simply saying that the Jacobian of <span class="math">$f$</span> at <span class="math">$x$</span> is <span class="math">$A$</span>, and so the vjp is to multiply the vector transpose by <span class="math">$A$</span>.</p>
<p>Likewise, for element-wise operations, the Jacobian is diagonal, and thus the vjp is multiplying once again by a diagonal matrix against the derivative, deriving the same pullback as we had for backpropogation in a neural network. This then is a quicker encoding and derivation of backpropogation.</p>
<h3>Multivariate Derivatives from Reverse Mode</h3>
<p>Since the primitive of reverse mode is the vjp, we can understand its behavior by looking at a large primitive. In our simplest case, the function <span class="math">$f(x)=Ax$</span> outputs a vector value, which we apply our loss function <span class="math">$L(y) = \Vert y-t \Vert$</span> to get a scalar. Thus we seed the scalar output <span class="math">$v=1$</span>, and in the first step backwards we have a vector to scalar function, so the first pullback transforms from <span class="math">$1$</span> to the vector <span class="math">$v_2 = 2|y-t|$</span>. Then we take that vector and multiply it like <span class="math">$v_2^T A$</span> to get the derivatives w.r.t. <span class="math">$x$</span>.</p>
<p>Now let <span class="math">$L(y)$</span> be a vector function, i.e. we output a vector instead of a scalar from our loss function. Then <span class="math">$v$</span> is the <em>seed</em> to this process. Let&#39;s assume that <span class="math">$v = e_i$</span>, one of the basis vectors. Then</p>
<p class="math">\[
v_i^T J = e_i^T J
\]</p>
<p>pulls computes a row of the Jacobian. There, if we had a vector function <span class="math">$y=f(x)$</span>, the pullback <span class="math">$\mathcal{B}_f^x(e_i)$</span> is the row of the Jacobian <span class="math">$f'(x)$</span>. Concatenating these is thus a way to build a full Jacobian. The gradient is thus a special case where <span class="math">$y$</span> is scalar, and thus the resulting Jacobian is just a single row, and therefore we set the seed equal to <span class="math">$1$</span> to compute the unscaled gradient.</p>
<h3>Multi-Seeding</h3>
<p>Similarly to forward-mode having a dual number with multiple simultanious derivatives through partials <span class="math">$d = x + v_1 \epsilon_1 + \ldots + v_m \epsilon_m$</span>, one can see that multi-seeding is an option in reverse-mode AD by, instead of pulling back a matrix instead of a row vector, where each row is a direction. Thus the matrix <span class="math">$A = [v_1 v_2 \ldots v_n]^T$</span> evaluated as <span class="math">$\mathcal{B}_f^x(A)$</span> is the equivalent operation to the forward-mode <span class="math">$f(d)$</span> for generalized multivariate multiseeded reverse-mode automatic differentiation. One should take care to recognize the Jacobian as a generalized linear operator in this case and ensure that the shapes in the program correctly handle this storage of the reverse seed. When linear, this will automatically make use of BLAS3 operations, making it an efficient form for neural networks.</p>
<h3>Sparse Reverse Mode AD</h3>
<p>Since the Jacobian is built row-by-row with reverse mode AD, the sparse differentiation disucssion from forward-mode AD applies similarly but to the transpose. Therefore, in order to perform sparse reverse mode automatic differentiation, one would build up a connectivity graph of the columns, and perform a coloring algorithm on this graph. The seeds of the reverse call, <span class="math">$v_i$</span>, would then be the color vectors, which would compute compressed rows, that are then decompressed similarly to the forward-mode case.</p>
<h3>Forward Mode vs Reverse Mode</h3>
<p>Notice that a pullback of a single scalar gives the gradient of a function, while the <em>pushforward</em> using forward-mode of a dual gives a directional derivative. Forward mode computes columns of a Jacobian, while reverse mode computes gradients &#40;rows of a Jacobian&#41;. Therefore, the relative efficiency of the two approaches is based on the size of the Jacobian. If <span class="math">$f:\mathbb{R}^n \rightarrow \mathbb{R}^m$</span>, then the Jacobian is of size <span class="math">$m \times n$</span>. If <span class="math">$m$</span> is much smaller than <span class="math">$n$</span>, then computing by each row will be faster, and thus use reverse mode. In the case of a gradient, <span class="math">$m=1$</span> while <span class="math">$n$</span> can be large, leading to this phonomena. Likewise, if <span class="math">$n$</span> is much smaller than <span class="math">$m$</span>, then computing by each column will be faster. We will see shortly the reverse mode AD has a high overhead with respect to forward mode, and thus if the values are relatively equal &#40;or <span class="math">$n$</span> and <span class="math">$m$</span> are small&#41;, forward mode is more efficient.</p>
<p>However, since optimization needs gradients, reverse-mode definitely has a place in the standard toolchain which is why backproogation is so central to machine learning.</p>
<h3>Side Note on Mixed Mode</h3>
<p>Interestingly, one can find cases where mixing the forward and reverse mode results would give an asymtopically better result. For example, if a Jacobian was non-zero in only the first 3 rows and first 3 columns, then sparse forward mode would still require N partials and reverse mode would require M seeds. However, one forward mode call of 3 partials and one reverse mode call of 3 seeds would calculate all three rows and columns with <span class="math">$\mathcal{O}(1)$</span> work, as opposed to <span class="math">$\mathcal{O}(N)$</span> or <span class="math">$\mathcal{O}(M)$</span>. Exactly how to make use of this insight in an automated manner is an open research question.</p>
<h3>Forward-Over-Reverse and Hessian-Free Products</h3>
<p>Using this knowledge, we can also develop quick ways for computing the Hessian. Recall from earlier in the discussion that Hessians are the Jacobian of the gradient. So let&#39;s say for a scalar function <span class="math">$f$</span> we want to compute the Hessian. To compute the gradient, we use the reverse-mode AD pullback <span class="math">$\nabla f(x) = \mathcal{B}_f^x(1)$</span>. Recall that the pullback is a function of <span class="math">$x$</span> since that is the value at which the values from the forward pass are taken. Then since the Jacobian of the gradient vector is <span class="math">$n \times n$</span> &#40;as many terms in the gradient as there are inputs&#33;&#41;, it holds that we want to use forward-mode AD for this Jacobian. Therefore, using the dual number <span class="math">$x = x_0 + e_1 \epsilon_1 + \ldots + e_n \epsilon_n$</span> the reverse mode gradient function computes the full Hessian in one forward pass. What this amounts to is pushing forward the dual number forward sensitivities when building the pullback, and then when doing the pullback the dual portions, will be holding vectors for the columns of the Hessian.</p>
<p>Similarly, Hessian-vector products without computing the Hessian can be computed using the Jacobian-vector product trick on the function defined by the gradient. Here, <span class="math">$Hv$</span> is equivalent to the dual part of</p>
<p class="math">\[
\nabla f(x+v\epsilon) = \mathcal{B}_f^{x+v\epsilon}(1)
\]</p>
<p>This means that our Newton method for optimization:</p>
<p class="math">\[
p_{i+1} = p_i - H(p_i)^{-1} \frac{dC(p_i)}{dp}
\]</p>
<p>can be treated similarly to that for the nonlinear solving problem, where the linear system can be solved using Hessian-free vector products to build a Krylov subspace, giving rise to the <em>Hessian-free Newton Krylov</em> method for optimization.</p>
<h3>References</h3>
<p>We thank <a href="https://www.cs.toronto.edu/~rgrosse/courses/csc321_2018/slides/lec06.pdf">Roger Grosse&#39;s lecture notes</a> for the amazing tikz graphs.</p>


        <HR/>
        <div class="footer">
          <p>
            Published from <a href="estimation_identification.jmd">estimation_identification.jmd</a>
            using <a href="http://github.com/JunoLab/Weave.jl">Weave.jl</a> v0.10.6 on 2020-10-22.
          </p>
        </div>
      </div>
    </div>
  </div>
</BODY>

</HTML>
